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Abstract — In this work, we propose the kernel Pitman-Yor process 
(KPYP) for nonparametric clustering of data with general spatial or 
temporal interdependencies. The KPYP is constructed by first intro- 
ducing an infinite sequence of random locations. Then, based on the 
stick-breaking construction of the Pitman-Yor process, we define a 
predictor-dependent random probability measure by considering that 
the discount hyperparameters of the Beta-distributed random weights 
(stick variables) of the process are not uniform among the weights, but 
controlled by a kernel function expressing the proximity between the 
location assigned to each weight and the given predictors. 

Index Terms — Pitman-Yor process, kernel functions, unsupervised 
clustering 



1 Introduction 

Nonparametric Bayesian modeling techniques, espe- 
cially Dirichlet process mixture (DPM) models, have 
become very popular in statistics over the last few years, 
for performing nonparametric density estimation U, Q, 
||3| . This theory is based on the observation that an infi- 
nite number of component distributions in an ordinary 
finite mixture model (clustering model) tends on the 
limit to a Dirichlet process (DP) prior ||2l, |4|. Eventually, 
the nonparametric Bayesian inference scheme induced 
by a DPM model yields a posterior distribution on the 
proper number of model component densities (inferred 
clusters) |5l, rather than selecting a fixed number of 
mixture components. Hence, the obtained nonparamet- 
ric Bayesian formulation eliminates the need of doing 
inference (or making arbitrary choices) on the number 
of mixture components (clusters) necessary to represent 
the modeled data. 

An interesting alternative to the Dirichlet process prior 
for nonparametric Bayesian modeling is the Pitman-Yor 
process (PYP) prior |6|. Pitman-Yor processes produce 
power-law distributions that allow for better modeling 
populations comprising a high number of clusters with 
low popularity and a low number of clusters with high 
popularity l7|- Indeed, the Pitman-Yor process prior can 
be viewed as a generalization of the Dirichlet process 
prior, and reduces to it for a specific selection of its pa- 
rameter values. In |8l, a Gaussian process-based coupled 
PYP method for joint segmentation of multiple images 
is proposed. 

A different perspective to the problem of nonpara- 
metric data modeling was introduced in f9], where 



S.P.C. and D.K. have equal contributions to this work. 



the authors proposed the kernel stick-breaking process 
(KSBP). The KSBP imposes the assumption that cluster- 
ing is more probable if two feature vectors are close 
in a prescribed (general) space, which may be asso- 
ciated explicitly with the spatial or temporal position 
of the modeled data. This way, the KSBP is capable 
of exploiting available prior information regarding the 
spatial or temporal relations and dependencies between 
the modeled data. 

Inspired by these advances, and motivated by the 
interesting properties of the PYP, in this paper we 
come up with a different approach towards predictor- 
dependent random probability measures for non- 
parametric Bayesian clustering. We first introduce an 
infinite sequence of random spatial or temporal loca- 
tions. Then, based on the stick-breaking construction of 
the Pitman-Yor process, we define a predictor-dependent 
random probability measure by considering that the 
discount hyperparameters of the Beta-distributed ran- 
dom weights (stick variables) of the process are not 
uniform among the weights, but controlled by a kernel 
function expressing the proximity between the location 
assigned to each weight and the given predictors. The 
obtained random probability measure is dubbed the 
kernel Pitman-Yor process (KPYP) for non-parametric 
clustering of data with general spatial or temporal in- 
terdependencies. We empirically study the performance 
of the KPYP prior in unsupervised image segmentation 
and text-dependent speaker identification, and compare 
it to the kernel stick-breaking process, and the Dirichlet 
process prior. 

The remainder of this paper is organized as follows: In 
Section 2, we provide a brief presentation the Pitman-Yor 
process, as well as the kernel stick-breaking process, and 
its desirable properties in clustering data with spatial 
or temporal dependencies. In Section 3, the proposed 
nonparametric prior for clustering data with temporal 
or spatial dependencies is introduced, its relations to ex- 
isting methods are discussed, and an efficient variational 
Bayesian algorithm for model inference is derived. 

2 Theoretical Background 

2.1 The Pitman-Yor Process 

Dirichlet process (DP) models were first introduced by 
Ferguson fTTH . A DP is characterized by a base distribu- 
tion Go and a positive scalar a, usually referred to as the 
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innovation parameter, and is denoted as DP{a,Go). Es- 
sentially, a DP is a distribution placed over a distribution. 
Let us suppose we randomly draw a sample distribution 
G from a DP, and, subsequently, we independently draw 
M random variables {Qm}m=i from G: 



G\a,Go DP(a,G'o) 
e;jG-G, m = l,...Af 



(1) 
(2) 



Integrating out G, the joint distribution of the variables 
{Q*n}m=i can be shown to exhibit a clustering effect. 
Specifically, given the first AI—1 samples of G, {Qm}m=i' 
it can be shown that a new sample is either (a) 
drawn from the base distribution Go with probability 
or (b) is selected from the existing draws, ac- 
cording to a multinomial allocation, with probabilities 
proportional to the number of the previous draws with 
the same allocation [12| . Let {0c}^i be the set of distinct where 
values taken by the variables {©mlm^i- Denoting as 
fc^^^ the number of values in {&m}m=i tf^^f equal to 
6c, the distribution of given {Qm}m=i can be shown 
to be of the form (12) 



p(0Ml{0™}*fri\«,Go) 



c=l 



- 1 



Go 



(3) 



M - 1 



where Sq^ denotes the distribution concentrated at a 
single point 8c. 

The Pitman- Yor process I6i functions similar to the 
Dirichlet process. Let us suppose we randomly draw a 
sample distribution G from a PYP, and, subsequently, 
we independently draw M random variables {©m}m=i 
from G: 

G|d,a,Go - PY(d,a,Go) (4) 



with 



©mlG* ^ G, 



(5) 



where d £ [0, 1) is the discount parameter of the Pitman- 
Yor process, a > —d is its innovation parameter, and Go 
the base distribution. Integrating out G, similar to Eq. 
(3), we now yield 

> N a + dC 
,d,a,Go) = — — — — -Go 



Pmi\{Q*m} 



A/-1 



C 

E 

c=l 



M - 1 

y + M-l' 



(6) 



As we observe, the PYP yields an expression for 
p(e^^|{e^}*fri , Go) quite similar to that of the DP, also 
possessing the rich-gets-richer clustering property, i.e., 
the more samples have been assigned to a draw from 
Go, the more likely subsequent samples will be assigned 
to the same draw. Further, the more we draw from Go, 
the more likely a new sample will again be assigned to a 
new draw from Go. These two effects together produce a 
power-law distribution where many unique 0*j values are 
observed, most of them rarely Q. In particular, for d > 0, 



the number of unique values scales as 0{aAV^), where 
M is the total number of draws. Note also that, for d = 0, 
the Pitman- Yor process reduces to the Dirichlet process, 
in which case the number of unique values grows more 
slowly at 0{a\ogM) (131. 

A characterization of the (unconditional) distribu- 
tion of the random variable G drawn from a PYP, 
PY((i, a, Go), is provided by the stick-breaking construc- 
tion of Sethuraman (T4|. Consider two infinite collections 
of independent random variables v = (wc)^i/ {©c}^?^!/ 
where the Vc are drawn from a Beta distribution, and the 
Oc are independently drawn from the base distribution 
Go. The stick-breaking representation of G is then given 
by (H 



G = ^Wc{v)5^ 



p{vc) = Bcta(wc|l — d, a + dc) 

V = (^'c)c=l 
c-1 

vo,(v) = v,^{\-Vj) e[o,i] 



and 



^m^[v) = 1 



(7) 

(8) 
(9) 

(10) 
(11) 



2.2 The Kernel Stick-Breaking Process 

An alternative to the above approaches, allowing for 
taking into account additional prior information regard- 
ing spatial or temporal dependencies in the modeled 
datasets, is the kernel stick-breaking process introduced 
in ||9l- The basic notion in the formulation of the KSBP 
consists in the introduction of a predictor-dependent 
prior, which promotes clustering of adjacent data points 
in a prescribed (general) space. 

Let us consider that the observed data points y £ 3^ 
are associated with positions where measurement was 
taken x <^ X, arranged on a Z?-dimensional lattice. 
For example, in cases of sequential data modeling, the 
observed data points y are naturally associated with 
an one-dimensional lattice that depicts their temporal 
succession, i.e. the time point these measurements were 
taken. In cases of computer vision applications, we might 
be dealing with observations y measured on different 
locations on a two-dimensional or three-dimensional 
space X. 

To take this prior information into account, the KSBP 
postulates that the random process G in (1) comprises 
a function of the predictors x related to the observable 
data points y, expressing their location in the prescribed 
space X. Specifically, it is assumed that 



G = ^ti7c(f (a;))(56 



(12) 
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where 



Wc{v{x)) ^ Vcix,rc]^c)Y[{l ~ Vj{x,rj;^j)) e [0,1] 



v{x) = {vc{x,Tc;ipc))T=i 
Vc{x, Tc; il'c) = Vck{x, Tc, ipc 
=Beta(K|l,a) 



(13) 
(14) 

(15) 

(16) 



and k{x,rc;ipc) is a kernel function centered at Fc with 
hyperparameter ijjc- 

By selecting an appropriate form of the kernel function 
k{x,rc;ipc), KSBP allows for obtaining prior probabili- 
ties zuc{v{x)) for the derived clusters that depend on the 
values of the predictors (spatial or temporal locations) x. 
Indeed, the closer the location x of an observation y is 
to the location Fc assigned to the cth cluster, the higher 
the prior probability zuc{v{x)) becomes. Thus, the KSBP 
prior promotes by construction clustering of (spatially or 
temporally) adjacent data points. For example, a typical 
selection for the kernel fc(a;,Fc;V'c) is the radial basis 
function (RBF) kernel 



k{x,rc;ipc) = exp 



\x - F, 



^2 



(17) 



3 Proposed Approach 

3.1 Model Formulation 

We aim to obtain a clustering algorithm which takes into 
account the prior information regarding the (temporal or 
spatial) adjacencies of the observed data in the locations 
space X, promoting clustering of data adjacent in the 
space X, and discouraging clustering of data points 
relatively near in the feature space y but far in the 
locations space X. For this purpose, we seek to provide 
a location-dependent nonparametric prior for clustering 
the observed data y. 

Motivated by the definition and the properties of the 
Pitman- Yor process discussed in the previous section, to 
effect these goals, in this work we introduce a random 
probability measure G{x) under which, given the first 
M — 1 samples {(^m}m=i drawn from G, a new sample 
associated with a measurement location x is dis- 
tributed according to 



a + Ec=i [1 - k{x,Xc]il)c)] 



c 

c=l 



a + M -I 



Gn 



(18) 



k{x,Xc;'ipc) - 1 . 

—oea 



a + M -1 



where /^f^ ^ is the number of values in {6Jjj}^=i that 
equal to 0c, {0c}JJLi is the set of distinct values taken 
by the variables {e*^}t',Zlr Go is the employed base 
measure, Xc is the location assigned to the cth cluster. 



X = {xc}c, k{-,x;^) is a bounded kernel function taking 
values in the interval [0,1], such that 



lim k{x, x;ip) ^ 1 
lira k{x, x; ip) = 

dist(a:,a;)— >oo 



(19) 
(20) 



a is the innovation parameter of the process, conditioned 
to satisfy a > 0, and dist(-, ■) is the distance metric used 
by the employed kernel function. We dub this random 
probability measure G{x) the kernel Pitman- Yor process, 
and we denote 



e*„,\x;G ^ G{x), m = l,...M 



(21) 



with 



G{x)\k,a,X,Ga - KPYP{x;k,a,X,Go) (22) 

The stick-breaking construction of the KPYP G{x) 
follows directly from the above definition (18), and the 
relevant discussions of section 2. Considering a KPYP 
G with cluster locations set X = {xc}'^i, kernel func- 
tion satisfying the constraints (19) and (20), and 
innovation parameter a, we have 



G{x) = ^Wc{v{x))S0^ 



(23) 



where 

Vc{x) Beta {k{x, Xc, ipc), a + c [1 — k{x, Xc] i^c)]) (24) 
and 

c-l 

wc{v{x)) = v,{x) J|(l - vj{x)) e [0, 1] (25) 

Proposition 1. The stochastic process G{x) defined in 
(23)-(25) is a valid random probability measure. 
Proof. We need to show that 



^^civix)) = 1 



(26) 



c=l 



For this purpose, we follow an approach similar to ||9l- 
From (25), we have 

c-i c-i 
1-Y,^c{v{x))=l[[l~vc{x)] (27) 

c=l c=l 

Then, in the limit as G — > oo, and taking logs in both 
sides of (27), we have 

'DO OO 

-cuc{v{x)) = 1 if and only if log [1 — Vc{x)] = — oo 

c=l c=l 

(28) 

Based on Kolmogorov three series theorem, the 
summation on the right is over independent 
random variables and is equal to -co if and only 
if J2^i^{^'^s[^ ^ Vc{x)]] = — oo. However, Vc{x) 
follows a Beta distribution, which means vdx) G [0, 1], 
thus log [1 — Vc{x)] < 0, and hence its expectation is 
negative; thus, the condition is satisfied, and (26) holds 
true. 
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3.2 Relation to the KSBP 

Indeed, the proposed KPYP shares some common ideas 
with the KSBP of M- The KSBP considers that 



G{x) ^^mc{v{x))S(. 



(29) 



c=l 



where 



^civix)) = v,{x) - v,{x)) e [0, 1] (30) 
i=i 



Vc{x) = Vck{x,Xc;il^c) 
p(K) =Bcta(l/e|l,a) 



(31) 
(32) 



From this definition, we observe that there is a key 
difference between the KPYP and the KSBP: the KSBP 
multipHes stick variables sharing the same Beta prior 
with a bounded kernel function centered at a location 
X unique for each stick, to obtain a predictor (location)- 
dependent random probability measure. Instead, the 
KPYP considers stick variables with different Beta priors, 
with the prior of each stick variable employing a differ- 
ent "discount hjrperparameter," defined as a bounded 
kernel centered at a location x unique for each stick. This 
way, the KPYP controls the assignment of observations 
to clusters by discounting clusters the centers of which 
are too far from the clustered data points in the locations 
space X. 

It is interesting to compute the mean and variance 
of the stick variables Vc{x) for these two stochastic 
processes, for a given observation location x and cluster 
center Xc- In the case of the KPYP, we have 



where 



k(x^ Xc'i fpc) 
k{x,Xc;'>pc) + OLc 
k{x, Xc] ipc)ac 



(33) 



(k{x, Xc, i^c) + etc) (k{x, Xc, Ipc) + CKc + 1) 

(34) 



etc = a + c (1 - k{x, Xc, i/Jc)) 



On the contrary, for the KSBP we have 

k{x,Xc;ipc) 



E[vc{x)] 
Y[vc{x)] = 



l + a 
k{x, Xc, i'c)'^a 



(35) 



(36) 



(37) 



(l + a)" (a + 2) 

From (33) and (36), we observe that the for a given 
observation location x and cluster center Xc, same increase 
in the value of the kernel function k{x, Xc, ipc) induces a 
much greater increase in the expected value of the stick 
variable Vc{x) employed by the KPYP compared to the 
increase in the expectation of the stick variable Vc{x) 
employed by the KSBP. Hence, the predictor (location)- 
dependent prior probabilities of cluster assignment of 
the KPYP appear to vary more steeply with the em- 
ployed kernel function values compared to the KSBP. 



3.3 Variational Bayesian Inference 

Inference for nonparametric models can be conducted 
under a Bayesian setting, typically by means of varia- 
tional Bayes (e.g., IITSl ), or Monte Carlo techniques (e.g., 
HU). Here, we prefer a variational Bayesian approach, 
due to its better computational costs. For this purpose, 
we additionally impose a Gamma prior over the inno- 
vation parameter a, with 



p{a) = g{a\r]i,ri2)- 



(38) 



Let us a consider a set of observations Y = {y„}„^i 
with corresponding locations X = {xn}n^i. We postu- 
late for our observed data a likelihood function of the 
form 



PiVr. 



c) ^p{yJ9c) 



(39) 



where the hidden variables z„ are defined such that z„ = 
c if the nth data point is considered to be derived from 
the cth cluster. We impose a multinomial prior over the 
hidden variables z„, with 



Piz„ 



ZUc{v{Xn)) 



(40) 



where the mc{v{x)) are given by (25), with the prior over 
the Vc{x) given by (24). We also impose a suitable con- 
jugate exponential prior over the likelihood parameters 
Be 

Our variational Bayesian inference formalism consists 
in derivation of a family of variational posterior distri- 
butions q{.) which approximate the true posterior distri- 
bution over {zn}n=if {v{xn)}n=i, and {Oc}^i, and the 
innovation parameter a. Apparently, under this infinite 
dimensional setting, Bayesian inference is not tractable. 
For this reason, we fix a value C and we let the 
variational posterior over the Vi{x) have the property 
q{vc{x) = 1) = 1, Va; £ X, i.e. we set mc{v{x)) equal to 
zero for c > C, Va; G X. 

Let W = {a,{z„}:Li,{(«c(=r„))?.JLi,R}f=i} be 
the set of the parameters of our truncated model over 
which a prior distribution has been imposed, and S 
be the set of the hyperparameters of the model, com- 
prising the {ipc}^=i and the hyperparameters of the 
priors imposed over the innovation parameter a and 
the likelihood parameters 9c of the model. Variational 
Bayesian inference consists in derivation of an approx- 
imate posterior q{W) by maximization (in an iterative 
fashion) of the variational free energy 



J dWq{W)\og 



pi X,Y,W\E) 
q{W) 



(41) 



Having considered a conjugate exponential prior con- 
figuration, the variational posterior q{W) is expected to 
take the same functional form as the prior, p{W) flTl . 
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The variational free energy of our model reads 

P{a\r]i,r]2) 



C{q) = / Aaq{a) \ log 

, V^^V^ /■ -I / N / / NN, p(vc{xn)\a) 
+ 2^ 2^ dVc{Xn)q{Vc{Xn))\0g 

^ 1 „ 1 



q{Vc{Xn)) 



+t [ dOMo.)io/-j^+f:j:qiz..= 

c=l-' ^' c=l n=l 

X I J dv{Xn)q{v{Xn))\0gp{Zn = c\Xn) 

- log9(z„ = c) + J d9cqi6c)\ogp{yjec) 



(42) 



3.4 Variational Posteriors 



Let us denote as (.) the posterior expectation of a quan- 
tity. We have 



where 



q{vc{xn)) ^ Bcta(uc(a;n)|/3c,n,^c,n) 



m:Xm —x„ 



and 



where 



4,™ ={a)+c[l~ k{x„,Xc] ipc)] 
c 

miJCni— £Cri c' — C+1 

g(a) = ^(a|??i,f]2) 
ryi = 7?i + iV(C - 1) 

C-l N 



(43) 
(44) 

(45) 

(46) 

(47) 
(48) 

(49) 



c—1 n—1 

ip{.) denotes the Digamma function, and 

/ \ ^1 
(a) = — 

m 

Further, the cluster assignment variables yield 

g(z„c = 1) oc exp {{\ogmc{v{xn)))) exp ((p„c) (50) 
where 

c-l 



{\0gWc{v{Xn))) = ^ (log(l - Uc'(a;ri))) + (logUcCa^ri)) 
c' = l 

((9„c = (logp(y„|6'c)),(e^) 



(51) 
(52) 



and 



(logUc(a;„)) = ipilSc^n) - V'(/3c,n + $c^n) (53) 
(log(l - i>,(a;„))) = V(/3c,„) - V'(/3c,„ + /3c,„) (54) 



Regarding the parameters 9c, we obtain 

AT 

log(7(0c) oc logp(0c) + q{zn = c)logp(y„|0c) (55) 

Finally, regarding the model hyperparameters S, we 
obtain the hyperparameters of the employed kernel func- 
tions V'c by maximization of the lower bound C{q), and 
we heuristically select the values of the rest. 

3.5 Learning thie cluster locations Xc 

Regarding determination of the locations assigned to the 
obtained clusters, Xc, these can be obtained by either 
random selection or maximization of the variational 
free energy C{q) over them. The latter procedure can 
be conducted by means of any appropriate iterative 
maximization algorithm; here, we employ the popular 
L-BFGS algorithm for this purpose. Both random 
Xc selection and estimation by means of variational free 
energy optimization, using the L-BFGS algorithm, shall 
be evaluated in the experimental section of our paper. 
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